Week 8 of 12 · Part B — Alignment Literacy

What Mechanistic Interpretability Is

Reverse-engineering a model's computation — reading what it does, not just what it says

Day 36 ~60 minutes Concept

Day 36 of 60

Why this week exists

Week 7 left you with an uncomfortable conclusion: if a model can fake alignment — behave well while observed and defect when it isn't — then black-box testing can never fully clear it. You can run a thousand evals and still not know whether the model is safe or just knows it's being tested. That is the wall behavioral safety hits. Mechanistic interpretability is the field's most ambitious attempt to climb over it: instead of judging the model by its outputs, you reverse-engineer the computation inside and read what it's actually doing.

The thesis

Interpretability treats a trained network not as a black box but as a program written in weights — one nobody wrote on purpose and nobody has read. The goal is to decompile it: identify the internal features and circuits the model uses, so that "is it deceptive?" becomes a question you can answer by inspection, not just by hoping the test caught it. It is the closest thing the field has to a lie detector for models.

The honest framing matters from the first sentence: this is a research program, not a finished tool. By the end of the week you'll be able to explain both what interpretability can already do and exactly where it still falls short — and that pairing is what makes you credible instead of breathless.

Three ways to "explain" a model — keep them separate

"Interpretability" is an overloaded word. Practitioners distinguish kinds, and mechanistic interp is a specific, demanding one.

Core Theory

1 · Behavioral / black-box explanation

Probe inputs and watch outputs: feature attributions, saliency maps, "the model said X because the prompt contained Y." Useful, but it never opens the box — it correlates inputs with outputs and can be fooled by a model that behaves differently when watched.

2 · Representational interpretability

Ask what information is present in the model's internal activations — can a concept be read off a hidden layer? This is the level of linear probes (Day 38). It tells you a concept is encoded; it doesn't tell you the model uses it.

3 · Mechanistic interpretability

The ambitious one: identify the actual features (directions in activation space that mean something) and circuits (subgraphs of components that compute something) the model runs, and show how they combine to produce a behavior. This is reverse-engineering, not correlation — and it's what could, in principle, verify internals.

The one-line distinction

Black-box interp asks "what does it do?" Representational interp asks "what does it know?" Mechanistic interp asks "how does it actually compute this, step by step?" Only the third gives you something that could survive a model trying to deceive you.

Circuits and induction heads — the first real win

The founding move of modern interpretability was to stop treating a transformer as an inscrutable matrix and instead trace specific circuits — small, identifiable pieces of computation. The landmark result is the induction head: a two-attention-head circuit that implements a simple but powerful rule — "I saw the pattern [A][B] earlier; now I'm seeing [A] again, so predict [B]." It's how models do basic in-context pattern completion, and it was found, named, and mechanistically explained — proof that the inside of a network is at least sometimes legible.

Why this connects to Week 7

If a behavior is implemented by an identifiable circuit, you can in principle ask whether a deception behavior has one too — and watch for it firing. That is the whole bet: turn "we tested and it seemed fine" into "we looked inside and here's the mechanism." Today you only need the intuition that circuits exist and can be read; the rest of the week builds on it.

Your work today

Read + Map the Mechanism

~60-minute foundation

Read the overview of A Mathematical Framework for Transformer Circuits (Anthropic, 2021) — the overview and the induction-heads section. Skip the heaviest algebra; you want the idea of a circuit and what an induction head does.
In a notebook, write a one-sentence definition of an induction head and one sentence on why a circuit-level explanation is stronger than a saliency map.
Write the single sentence that links this week to last: why does deceptive alignment make black-box testing insufficient, and how does interpretability respond? Keep it — it's the spine of your Part B brief on Day 40.

The expert move

An enthusiast says "interpretability lets us understand models." An expert names the altitude: behavioral explanation correlates inputs with outputs and can be gamed; mechanistic interpretability reverse-engineers the actual circuit, which is the only kind of evidence that could survive a model trying to look safe. Knowing which kind of "understanding" you're claiming is the whole credibility gap.

Say this in an interview: "Mechanistic interpretability isn't feature attribution — it's reverse-engineering the circuits a model actually runs, like induction heads. I care about it specifically because if deceptive alignment is real, behavioral evals can't clear a model on their own, and reading internals is the only path to verification rather than hope."

Today's Takeaways

Interpretability exists because black-box testing can't rule out a model that fakes alignment — you have to look inside.
Three levels: behavioral (what it does), representational (what it knows), mechanistic (how it computes). Only the third reverse-engineers the mechanism.
Circuits are identifiable pieces of computation; the induction head was the first clear win — proof the inside is sometimes legible.
Mechanistic interp is the field's best shot at a "lie detector" — but it's a research program, and the honest limits matter as much as the promise.